Ensemble Techniques

Domain: Banking

Author: Abhinav Tyagi

Context:

Leveraging customer information is paramount for most businesses. In the case of a bank, attributes of customers like the ones mentioned below can be crucial in strategizing a marketing campaign when launching a new product.

Data Description:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be ('yes') or not ('no') subscribed

Objective:

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Import the necessary libraries and read the data as a data frame (2 marks)

2. Perform basic EDA which should include the following and print out your insights at every step. (20marks)

a. Shape and data type of the data (1 marks)

b. Check info of the dataset (1 marks)

Report statistical summary of the dataset. (3 marks)

d. Check the presence of missing values and impute if there is any (3 marks)

No presence of missing values in any column

e. Checking the presence of outliers and impute if there is any (3 marks)

It make sense to me that Balance can go high in the bank account, It should not be treated as an outlier as we can see no other columns have any outlier.

f. Report the distribution of independent variables. (2 marks)

g. Check frequency distribution of target feature and comment on your findings. (2 marks)

Ratio of 0 : 1 == 0.8830151954170445: 0.1169848045829555 (0.883 : 0.117)

h. Perform bivariate analysis using pairplot and mention your findings. (3 marks

i. Check correlation among independent features and mention if there is any collinearity. (2 marks)

3. Prepare the data to train a model – check if data types are appropriate, get rid of the missing values etc. (3 marks)

4.Train a decision tree model, note and comment on their performances across different classification metrics. (5 marks)

Decision Tree performed very well to predict the target column

5. Build the ensemble models (random forest, bagging classifier, Adaboosting, and gradient boosting, and stacking classifier) and compare the results. (15 marks)

Random Forest

Adaboost

GradientBoost

Bagging

6. Compare performances of all the models and comment on your findings. (5 marks)

Model Name Training Accuracy Testing Accuracy precision recall f1 Score support
DecisionTree 1.0 0.8872 0.46 0.47 0.47 1268
Random Forest 0.9601 0.8962 0.50 0.58 0.54 1268
Adaboost 0.9029 0.9030 0.57 0.28 0.38 1268
Gradient Boost 0.9083 0.9074 0.73 0.18 0.29 1268
Bagging 0.9728 0.9081 0.61 0.34 0.43 1268
  1. Bagging have more training anf testing accuracy
  2. class skew seems to affect learning and messes it up
  3. Training & accuracies are very high but precision recall f1 for target class 1 is not very satisfactory